Clement Lee (Newcastle University)
2025-02-04 (Tue)
Introduction
Practical 1
Data pipeline via the tidyverse
Practical 2
Make documents / presentations in a reproducible way
Practical 3
Q & A, wrap-up
Today’s slides: https://bit.ly/insights-north-east
Toolkits on https://insightsnortheast.co.uk/data/?
Lecturer in Statistics at Newcastle University since August 2022
Teach statistical analysis using R
Research on statistical models for (social) networks
Write in the chat
Experience in working with data, and what are your usual tools?
Have you coded in R (or other programming languages) before?
Future-proof your analysis
for others to replicate / reproduce
including your future self
ChatGPT & others making it easier to start
Also the stepping stone for machine learning and AI
but don’t start with something completely unfamiliar
garbage in, garbage out
Version control, CI/CD, and beyond
“But I need to make presentations, reports, visualisations, etc.”
Most can be done nicely via R
We will see how this can be done
Python great for big data pipelines
R is better in some other aspects
tidyverseUnderstand which one is better / more suited for your tasks
Free & open source
Download from https://cran.r-project.org/
Anyone can contribute:
https://cran.r-project.org/web/packages/available_packages_by_name.html
Some packages about “business”, “politics”, etc.
The (best) interface for using R
Download from https://posit.co/download/rstudio-desktop/
Let’s have a look
## # A tibble: 11,714 × 19
## Year start end property code street address
## <dbl> <dttm> <lgl> <chr> <chr> <chr> <chr>
## 1 2024 2017-02-10 00:00:00 NA N000080 CW 582 Brunswick Park Industr…
## 2 2024 2020-08-01 00:00:00 NA N000082 IM3 582 Brunswick Park Industr…
## 3 2024 2014-11-28 00:00:00 NA N000085 CW 582 Antique Pine Imports, …
## 4 2024 2016-06-10 00:00:00 NA N000087 IF 582 Brunswick Park Industr…
## 5 2024 1995-04-01 00:00:00 NA N000088 IF 582 9, Brunswick Park Indu…
## 6 2024 2015-08-05 00:00:00 NA N000090 CW 582 10, Brunswick Park Ind…
## 7 2024 1995-04-01 00:00:00 NA N000093 IF3 582 Brunswick Park Industr…
## 8 2024 1995-04-01 00:00:00 NA N000094 IF3 583 13, Brunswick Village,…
## 9 2024 1995-04-01 00:00:00 NA N000095 IF3 582 Brunswick Park Industr…
## 10 2024 2019-01-01 00:00:00 NA N000096 CW 12979 Unit 15, Brunswick Par…
## # ℹ 11,704 more rows
## # ℹ 12 more variables: `NEW foi_liable party` <chr>, `NEW c/o address` <chr>,
## # `rateable value` <dbl>, LIA <dbl>, TRL <dbl>, SBR <dbl>, Chr <dbl>,
## # s44a <lgl>, EXM <dbl>, DRR1 <lgl>, DRR2 <dbl>, `net charges` <dbl>
A table / spreadsheet in Excel (.xls, .xlsx, .csv, …) \(\approx\) a data frame / tibble in R
## [1] 28322.39
## [1] 78594.26
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 419 10380 3898 3598140
hist(table1$`net charges`,
breaks = 100, # approximately 100 bars
main = NA, xlab = NA) # don't print the title and the x-axis label yet
title(main = "Histogram of net charges for November 2024",
xlab = "Net charges (£)")plot(table1$start, table1$`net charges`, log = "y", # on log scale
main = NA, xlab = NA, ylab = NA, # don't print title & axis labels yet
cex = 0.1) # size of the points
title(main = "Net charges over the starting date of businesses",
xlab = "Starting date", ylab = "Net charges (£)")What if we have an xls or csv file instead?
read_xls() from the same package readxl
read.csv() that comes with R
What if we want to create new columns (by manipulating some existing columns) and plot them?
Tedious if we need to
create them first,
re-save the spreadsheet,
and then rerun the analysis
It is usually the best not to modify the raw / original data.
Plots still not perfect
Huge skew of net charges \(\rightarrow\) uninformative histogram; perhaps log scale help?
Faffy modifications; instead, we will use the package ggplot2
We can build a data pipeline within R to
read the data,
manipulate them, and
make very nice plots
Tidyverse (https://www.tidyverse.org/), arguably the biggest development in R ever
ggplot2 is part of itGo through the practical in groups
Make summaries and plots using your own data or the business rates data
Note any challenges / difficulties, or anything you find useful
Report your findings
Continue with the previous data frame table1
Let’s say we want to calculate the mean of the square root of the net charges
## [1] 48.93841
The way the functions are typed is not that the way we think about applying them
The code below is more linear
## [1] 48.93841
Collection of R packages designed for data science
All packages share an underlying design philosophy, grammar, and data structures
Installation: only need to do it once (theoretically)
The above might require a lot more packages to be installed
After all, it’s just a collection of packages
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("readr")
install.packages("stringr")
install.packages("tibble")
install.packages("purrr")
install.packages("forcats")Type above commands once and first
Type below commands every time you need them
library(tidyverse)dplyrtable2 <- table1 |>
mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
select(date, ratio, `rateable value`:`net charges`, "address") # select / reorder columns
table2## # A tibble: 11,714 × 13
## date ratio `rateable value` LIA TRL SBR Chr s44a EXM
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 2017-02-10 0.250 45750 22829. NA NA NA NA NA
## 2 2020-08-01 1 35500 17714. NA NA NA NA NA
## 3 2014-11-28 0.2 11250 5614. NA NA -4491 NA NA
## 4 2016-06-10 0.250 16500 8234. NA NA NA NA NA
## 5 1995-04-01 1 21000 10479 NA NA NA NA NA
## 6 2015-08-05 1 52000 28392 NA NA NA NA NA
## 7 1995-04-01 0.0417 12500 6238. NA -5198. NA NA NA
## 8 1995-04-01 0.339 13250 6612. NA -3857. NA NA NA
## 9 1995-04-01 0 11000 5489 NA -5489 NA NA NA
## 10 2019-01-01 1 34250 17091. NA NA NA NA NA
## # ℹ 11,704 more rows
## # ℹ 4 more variables: DRR1 <lgl>, DRR2 <dbl>, `net charges` <dbl>,
## # address <chr>
dplyrFilter rows where DRR1 is not empty (NA in R)
Other uses (not shown):
Join other tables
Summarise columns
## # A tibble: 3 × 13
## date ratio `rateable value` LIA TRL SBR Chr s44a EXM DRR1
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 1995-04-01 0.814 272500 148785 NA NA NA NA NA TRUE
## 2 2021-03-09 0.911 47500 23702. NA NA NA NA NA TRUE
## 3 2005-08-21 1 3060000 1670760 0 NA NA NA NA FALSE
## # ℹ 3 more variables: DRR2 <dbl>, `net charges` <dbl>, address <chr>
tidyr (+ dplyr)Pivot tables
Combine columns between “rateable value” & “net charges”
New columns “name” and “value”
table3 <- table2 |>
pivot_longer(`rateable value`:`net charges`) |>
select(date, ratio, name, value, everything()) |> # a dplyr function
filter(!is.na(value)) # another dplyr function
table3## # A tibble: 46,557 × 5
## date ratio name value address
## <date> <dbl> <chr> <dbl> <chr>
## 1 2017-02-10 0.250 rateable value 45750 Brunswick Park Industrial Estate, Br…
## 2 2017-02-10 0.250 LIA 22829. Brunswick Park Industrial Estate, Br…
## 3 2017-02-10 0.250 DRR2 -17122. Brunswick Park Industrial Estate, Br…
## 4 2017-02-10 0.250 net charges 5707. Brunswick Park Industrial Estate, Br…
## 5 2020-08-01 1 rateable value 35500 Brunswick Park Industrial Estate, Br…
## 6 2020-08-01 1 LIA 17714. Brunswick Park Industrial Estate, Br…
## 7 2020-08-01 1 net charges 17714. Brunswick Park Industrial Estate, Br…
## 8 2014-11-28 0.2 rateable value 11250 Antique Pine Imports, Brunswick Park…
## 9 2014-11-28 0.2 LIA 5614. Antique Pine Imports, Brunswick Park…
## 10 2014-11-28 0.2 Chr -4491 Antique Pine Imports, Brunswick Park…
## # ℹ 46,547 more rows
stringr (+ dplyr)Extract postcode
Useful if we want to combine with coordinates for plotting on a map
table3 |>
mutate(postcode = str_extract(address, "NE\\d+ \\d\\w{2}")) |> # within a dplyr function
select(-address) # postcode suffices for finding the coordinates## # A tibble: 46,557 × 5
## date ratio name value postcode
## <date> <dbl> <chr> <dbl> <chr>
## 1 2017-02-10 0.250 rateable value 45750 NE13 7BA
## 2 2017-02-10 0.250 LIA 22829. NE13 7BA
## 3 2017-02-10 0.250 DRR2 -17122. NE13 7BA
## 4 2017-02-10 0.250 net charges 5707. NE13 7BA
## 5 2020-08-01 1 rateable value 35500 NE13 7BA
## 6 2020-08-01 1 LIA 17714. NE13 7BA
## 7 2020-08-01 1 net charges 17714. NE13 7BA
## 8 2014-11-28 0.2 rateable value 11250 NE13 7BA
## 9 2014-11-28 0.2 LIA 5614. NE13 7BA
## 10 2014-11-28 0.2 Chr -4491 NE13 7BA
## # ℹ 46,547 more rows
ggplot2Histogram of new column ratio we have created
Always the same syntax, as we will see
ggplot2Density of new column ratio we have created
Change the background colour and label size
table1 |>
mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
select(date, ratio, `rateable value`:`net charges`, "address") |>
ggplot() +
geom_density(aes(ratio)) +
theme_bw(18)table1 |>
mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
select(date, ratio, `rateable value`:`net charges`, "address") |>
ggplot() +
geom_point(aes(date, ratio)) +
theme_bw(18)table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = "Histogram of net charges for November 2024", x = "Net charges (£)")table1 <- read_xlsx("Business Rates properties - November 2024.xlsx")
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = "Histogram of net charges for November 2024", x = "Net charges (£)")month_year <- "November 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")month_year <- "August 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")month_year <- "May 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")Install the packages first if you haven’t already
Look at the examples in the slides, and try to do the same for your data
Look at the package manuals for inspiration: https://www.tidyverse.org/packages/
If you know what plots you want but not the code, ask
Note anything useful / challenging, and report your findings